Processing data using sequential dependencies

ABSTRACT

Methods and apparatus for processing data using sequential dependencies are disclosed herein. An example method includes modifying a first number of values in a sequence of a data set to generate a modified sequence such that each difference between each successive pair of values is within a threshold. A satisfiability metric is determined for the modified sequence based on a relationship between a number of modifications to the values in the sequence and a size of the sequence.

RELATED APPLICATIONS

This patent arises from a continuation of U.S. patent application Ser.No. 12/592,586, entitled “PROCESSING DATA USING SEQUENTIALDEPENDENCIES,” which was filed on Nov. 30, 2009, and which is herebyincorporated herein by reference in its entirety.

FIELD OF THE DISCLOSURE

This disclosure relates to processing large volumes of data to revealdata reliability in conforming to selected categories of orderedattributes. It invokes sequential dependencies to express the orderedattributes.

BACKGROUND

Interesting data sets often contain attributes with ordered domains:timestamps, sequence numbers, surrogate keys, measured values such assales, temperature and stock prices, etc. Understanding the semantics ofsuch data is an important practical problem, both for data qualityassessment as well as knowledge discovery. However, integrityconstraints such as functional and inclusion dependencies do not expressany ordering properties.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosure may be better understood when considered in conjunctionwith the drawing in which:

FIG. 1 is a tableaux for an SD time→₀,∞) count;

FIG. 2 is a tableaux for an SD pollnum→_(r) time

FIG. 3 illustrates adjusting marginal cardinalities;

FIGS. 4-9 show various tableau sizes for data representing DOWJONESaverages; and

FIGS. 10-15 illustrate scalability for various data sets.

DETAILED DESCRIPTION

Interesting data sets often contain attributes with ordered domains:timestamps, sequence numbers, surrogate keys, measured values such assales, temperature and stock prices, etc. Understanding the semantics ofsuch data is an important practical problem, both for data qualityassessment as well as knowledge discovery. However, integrityconstraints such as functional and inclusion dependencies do not expressany ordering properties. In this patent, we study sequentialdependencies for ordered data and present a framework for discoveringwhich subsets of the data obey a given sequential dependency.

Given an interval G, a sequential dependency (SD) on attributes X and Y,written as X→G Y, denotes that the distance between the Y-values of anytwo consecutive records, when sorted on X, are within G. SDs of the formX→_((0,∞)) Y and X→_((=∞)) Y specify that Y is strictly increasing andnon-increasing, respectively, with X, and correspond to classical OrderDependencies (ODs). They are useful in data quality analysis (e.g.,sequence numbers must be increasing over time) and data mining (in abusiness database, delivery date increases with shipping date, in asensor network, battery voltage increases with temperature, etc.) SDsgeneralize ODs and can express other interesting relationships betweenordered attributes. An SD of the form sequence number→_([4,5]) timespecifies that the time “gaps” between consecutive sequence numbers arebetween 4 and 5. In the context of data quality, SDs can measure thequality of service of a data feed that is expected to arrive with somefrequency, e.g., a stock ticker that should generate updated stockprices every 4 to 5 minutes. In terms of data mining, the SDdate→_((20,∞)) price identifies stock prices that rapidly increase fromday to day (by at least 20 points).

In practice, even “clean” data may contain outliers. We characterize thedegree of satisfaction of an SD by a given data set via a confidencemeasure. Furthermore, real data sets, especially those with orderedattributes, are inherently heterogeneous, e.g., the frequency of a datafeed varies with time of day, measure attributes fluctuate over time,etc. Therefore, Conditional Sequential Dependencies (CSDs) are proposed,which extend SDs analogously to how Conditional Functional Dependenciesextend traditional FDs.

A CSD consists of an underlying SD plus a representation of the subsetsof the data that satisfy this SD. Similar to CFDs, the representationused here is a tableau, but the tableau rows are intervals on theordered attributes.

Internet Service Providers (ISPs) collect various network performancestatistics, such as the number of packets flowing on each link. Thesemeasurements are maintained by routers in the form of cumulativecounters, which are probed periodically by a data collection system. Aplot of packet counts versus time is shown in FIG. 1. While the countsare expected to increase over time, counters are finite (e.g., 32 bits)and thus periodically loop around. Furthermore, counters reset wheneverthe router is rebooted. Additionally, spurious measurements may appear(e.g., at time 16 in FIG. 1), such as when the data collector probes thewrong router. Due to the cyclic nature of the counters, the semantics ofthis data set cannot be captured by the SD time→_((0,∞)) count; we needa conditional SD whose tableau identifies subsets that satisfy theembedded SD. For instance, each pattern in Tableau A from FIG. 1corresponds to an interval that exactly satisfies the embedded SD.Alternatively, a small number of violations may be allowed in order toproduce more informative tableaux and help avoid “overfitting” the data.Tableau B from FIG. 1 contains two patterns that capture the twomostly-increasing fragments of the data set (with one violation at time16). It not only identifies the intervals over which the SD is obeyedbut also pinpoints the time at which there is a disruption in theordering (at time 11). Such tableaux are useful tools for conciselysummarizing the data semantics and identifying possible problems withthe network or the data collector, e.g., a tableau with many “short”patterns suggests premature counter roll-over.

An ISP may also be interested in auditing the polling frequency. Thedata collector may be configured to probe the counters every tenseconds; more frequent polls may indicate problems at the collector (itmay be polling the same router multiple times) while missing data may becaused by a misconfigured collector or a router that is not respondingto probes. A possible sequence of measurement times (not the actualcounter values) is shown in FIG. 2, sorted in polled order, along with atableau (labeled Tableau A) for the embedded SD pollnum→_([9,11]) time,which asserts that the gaps between adjacent polls should be between 9and 11 seconds. Again, each pattern is allowed a small number ofviolations to better capture the trends in the data; e.g., the firstpattern [10, 90] contains one gap of length 20.

Furthermore, testing related SDs with different gap ranges revealsintervals that violate the expected semantics. For example,pollnum→_(([20,∞)) time finds subsequences with (mostly) long gaps, asshown in Tableau B. Similarly, pollnum→_([0,10)) time detects periods ofexcessively frequent measurements. The corresponding tableaux provideconcise representations of subsets that deviate from the expectedsemantics, and are easier to analyze by a user than a raw (possibly verylengthy) list of all pairs of records with incorrect gaps. It is worthnoting that simply counting the number of polls to detect problems isinsufficient: if the window size for counts is too small (say, tenseconds), then false positives can occur if polls arrive slightly late;if the window size is too large (say, one hour), then false negativescan occur due to missing and extraneous data “canceling each other out”.

A basic aspect of the disclosure is an integrity constraint for ordereddata. The mechanisms generating ordered data often provide the ordersemantics—sequence numbers are increasing, measurements arrive every tenseconds, etc. However, finding subsets of the data obeying the expectedsemantics is laborious to do manually. We therefore assume that theembedded SD has been supplied and solve the problem of discovering a“good” pattern tableau. An objective is parsimonious tableaux that usethe fewest possible patterns to identify a large fraction of the data(“support”) that satisfy the embedded SD with few violations(“confidence”). The technical basis for this is a framework for CSDtableau discovery, which involves generating candidate intervals andconstructing a tableau using a smallest subset of candidate intervals(each of which has sufficiently high confidence) that collectively“cover” the desired fraction of the data.

In this model, every tableau pattern must independently satisfy theembedded

SD. The brute force algorithm computes the confidence of all Θ(N²)possible intervals (in a sequence of N elements) and identifies ascandidates those which have a sufficiently high confidence. Since thegoal is to discover a concise tableau, large intervals that cover moredata are preferred, and therefore candidate intervals that are containedin larger candidate intervals may be ignored. An initial observation isthat CSDs obey a “prefix” property, whereby the confidences of allprefixes of a given interval I are incrementally computed en route tocomputing the confidence of I itself. Thus, it suffices to compute theconfidence of the N intervals [i,N], where I≦i≦N, and, for each i, findthe maximum j such that the interval [i,j] has the required confidence.

A second observation is that CSDs also satisfy a “containment” property,which implies that the confidence of an interval slightly larger thansome interval I must be similar to that of I. An approximation algorithmmay be formulated that computes the confidence of a small set ofcarefully chosen intervals such that, for each candidate interval Iidentified by the exact algorithm, the algorithm is guaranteed toidentify a slightly larger interval with a confidence not significantlylower than that of I. Instead of computing the confidence of the Nintervals described above, the approximation algorithm only needs tocompute the confidence of O (logN)/δ) intervals, where 1+δ is a bound onthe approximation error.

In addition to improving the efficiency of the candidate generationphase, this framework improves the efficiency of the tableauconstruction step. This step solves the partial interval cover problemby choosing the fewest candidate intervals that cover the desiredfraction of the data. An exact dynamic programming algorithm for thisproblem takes quadratic time in the number of candidate intervals. Alinear-time and -space greedy heuristic is given to prove that itreturns tableaux with sizes within a constant factor (of nine) of theoptimal solution.

To summarize, among the main contributions are the following.

Conditional Sequential Dependencies are proposed as novel integrityconstraints for ordered data, and give efficient algorithms for testingtheir confidence.

A general framework is given that makes the discovery of “good” tableauxfor CSDs computationally feasible, provided that the confidence measuresatisfies the prefix and containment properties.

Experimental results are given demonstrating the efficiency(order-of-magnitude improvement over the brute force approach), as wellas the utility (in revealing useful data semantics and data qualityproblems), of the proposed framework on a wide range of real data sets.

Definitions:

Let S be a relational schema on attributes A₁,A₂, . . . , A_(k) withrelation instance R={t₁, t₂, . . . , t_(N)}. Let dom(X)={t₁[X], t₂[X], .. . , t_(N)[X]} refer to the set of domain values over X, where t [X]denotes the relation tuple t projected on the attributes X. The input tothe problem is modeled as a relation, some of whose attributes haveordered domains.

DEFINITION 1 Let X and Y, X C S and Y C S, be two attribute sets, G bean interval, and π be the permutation of rows of R increasing on X (thatis, tπ(1)[X]<tπ(2)[X]<. . . <tπ(N)[X]).

A sequential dependency (SD) X→G Y is said to hold over R if for all isuch that 1≦i≦N−1, t_(π(i+1))[Y]−t_(π(i))[Y] ε G.

That is, when sorted on X, the gaps between any two consecutive Y-valuesmust be within G.

X is referred to as the antecedent of the SD and Y as the consequent.Total orderings are assumed to exist on X and Y, and that there is amapping f( ) which linearizes the different combinations of attributevalues in X and Y into integers. For example, if X={hour, minute,second} then the tuple t[X]=(h,m,s) could be mapped via f(h,m,s)=3600h+60 m+s.

In practice, an SD may not hold exactly: when ordered on X, theresulting sequence of Y-values may not have the correct gaps. Previouswork on characterizing the extent to which ODs, FDs, and CFDs hold on agiven relation instance employed a “deletion-based” metric thatdetermines the largest possible subset of the relation that satisfiesthe constraint. Using this measure, the confidence of interval [11, 20]from FIG. 1 with respect to the SD time→_(0,∞)) count is 9/10 since thelargest subset that makes the SD valid contains every record except theone at time 16. The confidence of the entire data set, i.e., theinterval [1, 20], is 10/20 since the largest satisfying subset contains(the first) ten points.

Now consider the interval [10, 90] from FIG. 2. To satisfy the SDpollnum→_([9,11]) time, select either the first four points or the lastfour points in this interval, for a confidence of 4/8. This confidencevalue seems low since this interval has only one “problem”, namely amissing record around time 50. Thus the confidence of a CSD is definedusing the following edit distance metric, which is a natural extensionof the known deletion metric.

DEFINITION 2 The confidence of a suggested SD over a given relationinstance (or subset thereof) of size N is (N−OPS)/N, where OPS is thesmallest possible number of records that need to be inserted or deletedto make the SD hold.

Note that confidence cannot be negative since in the worst case, we can“delete” all but one record, which will trivially satisfy the SD. Thismetric has several useful properties. It is robust to occasional missingdata—in the above example, the interval [10, 90] has a confidence of 7/8since only one edit operation (insertion) needs to be made to satisfythe SD. It is also robust to spurious values. Returning to the aboveexample, the sequence <10, 20, 30, 1000, 40> has a relatively highconfidence of 4/5 since it suffices to delete the suspicious element1000. Furthermore, the metric penalizes based on gap sizes, unlike justcounting the fraction of “bad gaps” (i.e., those not in the specifiedgap range). For example, if all gaps are expected to be between 3 and 5,then a gap of 6 can be corrected by one insertion, but a gap of size1000 requires 199 insert operations.

Having defined the confidence of a SD, computing it (i.e., computingOPS) on a relation instance is described.

Consider a “simple” SD of the form X→_((0,∞)) Y, which requires Y to beincreasing with X. Note that this SD does not limit the maximum gaplength, so new records are not needed to reduce the lengths of oversizedgaps. Its confidence may be computed from the length of the longestincreasing subsequence on Y, after ordering the relation on X. Moreformally, let π be the permutation of rows of R increasing on X. We wishto find a longest subsequence π(i1)<π(i2)<. . . <π(iT) of π, i1<i2<. . .<iT, such that tπ(i1)[Y]<. . . <tπ_((it)) [Y], for some T≦N. Let SN bethe sequence <t_(π(1))[Y], . . . , t_(π(N))[Y]>. The length (not thesubsequence itself) of the longest increasing subsequence of SN isdenoted by LIS(S_(N)). Then the confidence of an SD on R isLIS(S_(N))IN, which can be computed in O(N logN) time. In general, SDsof the form X→_([G,∞)) Y, G a finite non-negative integer, can behandled in a similar way, by finding longest sequences increasing by atleast G at every step. We note that other measures of “sortedness” maybe natural for some applications (such as based on number of inversions,average inversion length or “satisfaction within bounds”) and could beused in place of this quantity throughout this description, and can becomputed within the same time complexity by the given framework.

SDs of the form X→_([G1,G2]) Y, are now considered, where0≦_(G1)≦_(G2)≠0. A sequence (of Y-values mapped to integers, when sortedon X) is valid if it is non-empty, all elements are integers, and allits gaps are between G₁ and G₂. Computing the confidence requiresfinding OPS(N)—the minimum number of integers that must be added to ordeleted from the length-N sequence in order to obtain a valid sequence.For example, the confidence of an SD with G₁=4 and G₂=6 on the sequence<5, 9, 12, 25, 31, 30, 34, 4022 is 1− 4/8=½. Deleting 12 and inserting15 and 20 in its place (or deleting 5, 9 and 12) and then deleting 31will convert the sequence into a valid one, and no series of three orfewer insertions and deletions will make the sequence valid. In general,the sequence need not be sorted, i.e., some gaps may be negative.

Given a sequence <a₁, a₂, . . . , aN>of integers, for i=1, 2, . . . , Nlet v=a, and define T(i) to be the minimum number of insertions anddeletions one must make to <a₁, a₂, . . . , a₁>in order to convert itinto a valid sequence ending in the number v. (Note that since the valuev might appear more than once in the sequence, one might get a sequenceending in a copy of v which is not the ith element in the sequence.) Nowcomputing OPS(N) from the T(i)'s can be done as follows:OPS(N)=min₀≦_(r)≦_(N−1){r+T(N−r)}, as proven in theory 3.

THEORY 3 The minimum number OPS(i) of insertions and deletions requiredto convert an input sequence S, into a valid one is given bymin₀≦_(r)≦_(i-31 1){r+T(i−r)}. Furthermore, OPS(i) can be calculatedinductively by OPS(I)=0 and OPS(i)=mint{1+OPS(i−PSinT(i)} for all i≧2.

PROOF. First, prove that OPS(i)≧min₀≦_(r)≦_(i-31 1){r+T(i−r)}. In theoptimal transformation, let r be the exact number of terms at the end ofthe sequence S_(i)=<a₁, a₂, . . . , a_(i) < which are removed; hence,α_(i-r) remains and appears in the final sequence. Clearly, 0≦r≦i-emainsand appears in thr terms, the optimal algorithm must transform theprefix consisting of the first i-r terms into a valid sequence ending inα_(i)—. The cost to do this is T(i-r), and hence the optimal total costis r+T(i-r). Since there is some r, 0≦r≦i-0Since there OPS(i)=r+T(i-r),it can be inferred that OPS(i)≧min₀≦r≦i-inr+T(i-r)}. ClearlyOPS(i)≦min₀≦r≦i-inr+T(i-r)} as well, since for each such r one could geta valid sequence by deleting the last r integers and then, at costT(i-r), converting the sequence <a₁, a₂, . . . , a_(i)—>into a validsequence ending in the value a_(i)—. The second statement follows fromOPS(i)=min₀≦r≦i—=mr+T(i−r)} by splitting off the r=0 case from the1≦r≦i-0 case fIn order to show how to compute the T(i)'s, a definitionof and a lemma about dcost is needed, a function which specifies thefewest integers that should be appended to a length-1 sequence to get avalid sequence whose last element is exactly d larger than its first.

DEFINITION 4 Define dcost(d), for d=0, 1, 2, . . . , to be the minimumnumber of integers one must append to the length-1 sequence <0> to get avalid sequence ending in d, and co if no such sequence exists.

It is nontrivial but not hard to prove the following lemma, whose proofis omitted here for simplicity.

LEMMA 5 If G₁=0, then dcost(d)=[d/G₂]. Otherwise, dcost(d)=[d/G₂] if[d+1)/G₁>[d/G₂] and co otherwise.

For example, if G₁=4 and G₂=6, then dcost(7)=∞. Furthermore, dcost(8)=2,uniquely obtained with two gaps of length 4. This is interesting sinceone might be tempted to infer from “dcost(d)=[d/G₂]” that all but onegap have length G₂.

LEMMA 6 Choose an i, 1≦i≦N. Let ν=a_(i). Then among all ways to convert<a₁, a₂, a_(i)> into a valid sequence ending in the number v, there isone in which the ith symbol is never deleted.

Keep in mind that ν=a_(i) may appear more than once in the sequence <a₁,a₂, . . . , a_(i)>. If one generates a valid sequence ending in thevalue ν, just which ν is it? The ν which is the ith symbol in thesequence? Or the ν which is the jth, for some j<I with a_(j)=a_(i)=ν?The content of this lemma is that there is always a minimum-cost way oftransforming the sequence into a valid sequence in which ν is the ithsymbol, not the jth.

PROOF. If the ith symbol is deleted, let j be the largest index of anondeleted symbol (which must exist). Clearly a_(j)≦a_(i), since in thefinal list all integers are at most ν=a_(i). If a_(j)<a_(i), then thealgorithm must at some point append an a_(i), but then it was wastefulto delete the ith integer in the first place, and so it should not have.Hence it may be assumed that a_(j)=a_(i). Now instead of deleting theith symbol and not deleting the jth, delete the jth and do not deletethe ith.

THEOREM 7 Having computed T(1), T(2), . . . , T(i−2), . . . , ii≦N, T(i)may be computed using the existing T(1), . . . , T(i−) as follows.Define

min₁:=i−1,

min₂:=min_(j:j<i,aj<ai){T(j)+(i−1−j)+[dcost(a_(i)-a_(j))−1]},

and define

min₃:=min_(j:j<i,aj<ai){T(j)+(i−−j) {T(j)+(i−1−j)}.

Then, T(i)=min {min₁,min₂} if G₁>0 and T(i)=min {min₁,min₂,min₃} ifG₁=0.

PROOF. Choose i, let ν=a_(i), and consider an optimal sequence of moveswhich converts <a₁, a₂, . . . , a_(i)> into a valid sequence whose lastentry is ν. By Lemma 6, it may be assumed that the optimal sequence ofmoves does not delete the ith entry. Either the optimal sequence deletesthe first i-entry. Either the optimal sequence deletes the first t e ofmi-entry. Either the optimal sej be the maximum index less than i suchthat the jth symbol is not deleted. Clearly a_(j+1), a_(j+2), . . . ,a_(i−1), a total of i−1−j integers, are deleted.

If G_(i)>0, then, since a_(i) is not deleted, a_(j)<a_(i). Theadversary, who converts the input sequence into a valid sequence usingthe fewest operations, will then “bridge the gap” from a_(j) to a_(i),and convert <a_(l), . . . , a_(j)> into a valid sequence ending ata_(j), at a cost of T(j). Given a length-2 integral sequence <y, z>,y≦z, the number of integers one must insert between y and z to get avalid sequence (i.e., to “bridge the gap” from y to z) is

0 if y=z and G₁=0,

∞ if y=z and G₁>0, and

dcost(z−y)−1 if y<z.

Hence, the total cost is (i-Hej)+(dcost(a_(i)-a_(j))−1)+T(j).

If G₁=0, there is the additional possibility that a_(j)=a_(i). The costof bridging the gap is zero, for a total cost of (i−Tj)+T(j).

Having a recurrence for computing the T(i)'s, allows one to use therecurrence to calculate all the T(i)'s quickly. If, for each a everya_(j)-value with j< is evaluated for the recurrence, then the algorithmwill run in linear time for each i, or quadratic time in total. However,it is possible, for each i, to find the best j without trying all thelinearly-many j's. The idea here is that the dcost values are eitherfinite or infinite. Clearly any term having an infinite dcost can beignored. The observation is that the infinite dcost come in a limitednumber of consecutive blocks, and hence the finite dcost also come in alimited number of consecutive blocks (all but one of which have finitesize), which we call bands. It can be shown how to compute the minimumover one band, and therefore, for each i, the time to compute a_(i) willbe bounded by the product of the number of bands and the time per band.The overall time will be just N times this product.

Given some gap range [G_(i),G₂], the bands of finite values for dcostare the input value ranges [k G₁, k G₂], for integers k≧1. Note thatthese bands widen with increasing k (if G₁<G₂). Indeed, when k becomeslarge enough, the bands will overlap and, therefore, no more dcostvalues of ∞ will occur for d this large. Exactly when will the overlapfirst occur? There is no space for a d with dcost(d)=∞ between the band[l G₁, l G₂] and the next band [(l+1) G₁, (l+1) G₂] if and only if (l+1)G₁≦l G₂+1, i.e., l≧[(G₁−1)/(G₂-G₁)] (if G₁≠G₂). The case where G₁=G₂ istreated separately below.

Given a fixed a the formula for T(i) requires that we computedcost(a_(i)-a_(j)); hence, we wish to find the values of a_(j) for whichdcost(a_(i)-a_(j)) is finite. Since dcost(d) is finite within bands kG₁≦d≦k G₂ for each k, substituting d=a_(i)-a_(j) and solving for a_(j)yields bands a_(i)-k G₂≦a_(j)≦a_(i)-k G₁. So the bands with respect toa_(j) are now [a_(i)-G₂, a_(i)-G₁], [a_(i)-, 2G₂, a_(i)-2G₁], . . . ,[a_(i)-(l −1) G₂, -(l−1)G₁] and one band of infinite length [-∞, a_(i)-lG₁]. Since the a_(j)'s come from sequence element values, clearly wenever need to consider a_(j)-values less than the smallest value a_(min)in the sequence. Thus, we can threshold any band extending belowa_(min), ensuring that no band is of infinite length (i.e., if a_(min)lies within [-∞, a_(i)-aes₁] then this band gets truncated to [a_(min),. . . , a_(i)-l G₁]) and possibly resulting in fewer than l bands tosearch. Note that, since in each of these bands dcost is finite,dcost(d) is equivalently defined as [d/G₂]. Furthermore, since 0≦[x]-x<1for all x, we can substitute the function d/G₂ in place of [d/G₂] andobtain the same result, because all the other variables are integers soadding a fractional amount less than 1 will not change the rank orderfor the best aj.

Here is how the algorithm proceeds. For a fixed i, in any band (withfinite dcost) arg min_(j:j<i,aj<ai){T(j)+(i−+(j)+[dcost(a_(i)-a_(j))−1]}is equivalent to arg a_(j<ai){T(j)-j-a_(j)/G₂}. So for each band k(1≦k≦l), we find j(k)=arg min, {T(j)-j-a_(j)/G₂} subject to a, ε[a_(i)-k G₂, a_(i)-k G₁], or subject to a_(j) ε [amin, a_(i)-k G₁] ifa_(i)-k G₂<a_(min). Let j* be the minimum j from among these bands, thatis, j*=min_(k){j(^(k))}. Then min₂=T(j*)+(i-)+j*)+[(a_(i)-)_(j*))-1]. Wealso need to consider the j's for which a_(j)=a_(i). So we let j*=argmin_(j:j<,aj=ai) {T(j)-j-a_(j)/G₂} and min₃=T(j*)+(i-1-j*). Finally, wetake T(i)=min {min₁,min₂} if G₁>0 and T_((i))={min₁,min₂,min₃} if G₁=0.

For the case of G_(i)=G₂=G, given some integer G>0, the algorithm issimpler and can be computed in O(N log N) time. The idea is to partitionthe sequence elements a_(j) into G classes 0, 1, . . . , G-1, based ontheir (modG)-values. Then, given a_(l), we search only the a_(d)'s whosea_(j)=a_(l) (modG), a_(j)a_(l), and take the j with smallestT(j)-j-a_(j)/G as j*. Clearly, j* can be found in O(log N) time. Asusual, we let min2=T(j*)+(i( )*)+[dcost(a_(l)a_(j)*)-1].

THEOREM 8 The confidence of an SD X→_([G1,G2]) Y on a sequence of lengthN can be computed in time O(G₂ N logN/G₂−G₁) when G₁≠G₂ and in time O(NlogN) when G₁=G₂.

PROOF. For each of N sequence elements, we search in at most(G₁seq₂s₁)+1=G₂/G₂−G₁ bands for the arg min, and each band can besearched and updated in O(log N) time using a standard data structurefor range-min over arbitrary ranges of values. In fact, we can afford tofirst sort the sequence element values, thus transforming them intotheir ranks, and store the min over each dyadic interval in rank-space.That way, the ranges can be transformed into being over a universe ofsize N (i.e., the ranks)—which makes updates much easier—and a range-mincan be stored for every possible binary partition of the values withrespect to their ranks Then range query intervals can be decomposed intoO(log N) adjacent dyadic intervals, from which the result can beobtained. The total query time is the product of these, O(G₂ G₂-(NlogN).

DEFINITION 9 A Conditional Sequential Dependency (CSD) is a pair φ=(X→GY, Tr), where X→G Y, referred to as the embedded SD, and Tr is a “rangepattern tableau” which defines over which rows of R the dependencyapplies. Each pattern tr Tr specifies a range of values of X thatidentify a subset of R (subsequence on X). The CSD states that, for eachtr Tr, the embedded SD independently holds over the subset of therelation (subsequence on X) identified by tr.

Let [T_(i)[X], T_(j)[X]] be the interval represented by a tableaupattern tr; again, we let it be the permutation of rows in R sorted onX. We define the confidence of tr as the confidence of its intervalw.r.t. the embedded SD, the support of tr as the number of recordscontained in its interval, i.e., j−i+1, and the position interval of tras [i, j] (for example, the position interval of the pattern [30, 60]from Tableau B in FIG. 2 is [3, 5]). We also define the total support,or global support, of a CSD as the support of the union of the intervalsidentified by the tableau patterns (note that patterns may overlap).

The goal of tableau discovery is to find a parsimonious tableau whosepatterns all provide sufficient confidence and describe a sufficientportion of the data. Thus, given a relation instance and an embedded SD,we wish to find a smallest tableau (if any exists) subject to confidenceand (global) support threshold constraints.

DEFINITION 10 The CSD Tableau Discovery Problem is, given a relationinstance R, an embedded SD X→G Y, a global support thresholds and aconfidence threshold {hacek over (c)}, to find a tableau Tr of minimumsize such that the CSD φ=(X→G Y, Tr) has a global support at least ŝ andthat each tr □ T has confidence at least {hacek over (c)}.

Naturally, one could optionally impose a local support threshold that ismet by each tableau pattern, in order to ensure that spurious anduninteresting patterns are not reported. Furthermore, rather thanseeking a tableau with a sufficiently high global support, it may beuseful to ask for the k “best” patterns (e.g., those having the highestlocal support) regardless of the global support.

A general tableau discovery framework may be posed. It is assumed thatthe confidence of an interval I containing N points may be written asƒ(I)/N, where ƒ is some aggregate function, and that 0≦ƒ(I)≦N to ensurethat confidence is between zero and one. For the confidence metric,ƒ(I)=N-OPS and 1≦ƒ(I)≦N since more than N-since more than N−1 editoperations are not needed. The framework consists of two phases: (1)generating candidate intervals and (2) choosing from these candidates ina small subset providing suitable (global) support to be used for thetableau. What makes the first phase inherently challenging is that theconfidence of an interval may not be readily composed from those of itssubintervals due to the complex nature of whatever aggregate function isemployed in measuring confidence. Take FIG. 1 for example. Theconfidence of the interval [1, 10] is 1, the confidence of [11, 20] is0.9, but the confidence of [1, 20] is only 0.5. However, the followingproperties can be exploited.

DEFINITION 11 An aggregate function ƒ over a sequence is said to satisfythe prefix property if the time to compute f on all prefixes of asequence is no more than a constant greater than the time to compute iton the sequence itself. Hence the prefix property is a property of thealgorithm computing f rather than ƒ itself. Formally, we are given sometime bound G(N) and we need to assume that the property can be computedon all N prefixes of a sequence of length N in time G(N), in total.

DEFINITION 12 An aggregate function ƒ is said to satisfy the containmentproperty if for any sequence a and subsequence T appearing inconsecutive positions of σ,ƒ(τ)≦ƒ(94 ).

First, the given framework can be used to speed up interval generationwith any confidence measure whose aggregate function ƒ obeys both theprefix property and the containment property. Emphasis will be ondeveloping scalable algorithms (i e , running in time N polylogN). Theframework uses the confidence measure from Definition 2.

Only intervals satisfying the supplied confidence threshold areconsidered as tableau candidates. Given a choice between any twocandidates, where one is contained in the other, choosing the smallerone may unnecessarily increase the size of the tableau. Hence, for eachi, max j≧i (if any) should be such that the position interval [i, j] hasconfidence at least {hacek over (c)} (in the remainder of this section,position intervals will be referred to as intervals unless otherwisenoted). There are at most N such intervals as there is at most one witheach given left endpoint. (One could go further and remove all intervalscontained in others.)

A naive way to find candidate intervals would be to compute theconfidence of all N(N+1)/2 possible intervals between 1 and N. Using theprefix property this can be improved by a factor of N by computingconfidence over the intervals [1 . . . N], [2 . . . N], . . . , [N, . .. N−] and using intermediate results. Unfortunately, this is still tooexpensive for large data sets if computing the confidence on an intervalof length l requires (l) time, as it will require Ω(N²) time to find allmaximal intervals. How can we find these intervals without testing all(i, j) pairs? The trick, at the price of “cheating” on the confidence(as described below), is to test only a proper subset of the pairs, butenough so that, for any interval I chosen by an adversary (i.e., anyinterval which could appear in an optimal tableau), our set of candidateintervals contains one, J, which contains I and whose length is onlyslightly larger, specifically, |J|≦(1+e)|I|. Any aggregate function ƒsatisfying the containment property will satisfy ƒ(J)≧ƒ(I), and henceits confidence ƒ(J)/|J| will be at leastƒ(I)/|J|≧ƒ(I)/[(1+e)|I|]=((ƒ(I)/|I|)/(1+e), and hence at least 1/(1+e)times as large as I's. Thus, by “cheating” on confidence (but only bythe small factor 1/(1+e)), we can ensure that every adversary intervalis (barely) covered by some candidate interval.

An approximation algorithm may be given for efficiently generatingcandidate intervals. The algorithm takes a real e>0 and builds a set ofreal intervals in [0,N], with the following property. For anysubinterval I of [0,N] of length at least 1, among the intervalsgenerated by the algorithm is an interval J which contains I and whoselength is at most 1+e times as large 1.

Now the intervals are generated. Choose a small positive δ with a valueto be determined later. For each length of the form l_(h)=(1+δ)^(h), forh=0, 1, 2, . . . , until (1+δ)^(n) first equals or exceeds N, build afamily of intervals each of length l_(h), with left endpoints startingat 0, δ/_(h), 2δ/_(h), 3δ/_(h), . . . , in total, about N/(δ/_(h))intervals.

How much time will it take to compute the confidence of each suchinterval? Compute the sum of the lengths of the intervals, and multiplyat the end by g(N)/N. For each of the log₁₊δ N values h, there areN/(δl_(h)) intervals, each of length l_(h). Hence their sum of lengthsis N/δ. It follows that the sum of their lengths is the number of h's,i.e., log1+δ N, times N/δ. Since log_(1+δ) N is approximately (IgN)/δfor small δ, the product is (N log N)/δ².

However, we can do better. To date we have used only the containmentproperty; now we use the prefix property. We modify the intervals'design so that many will have the same left end-point. Break theintervals into groups according to their lengths: those with lengths in[1, 2), those with lengths in [2, 4), those with lengths in [4, 8), etc.There are obviously IgN groups. Within a group, our intervals havelength l_(h) for varying h's; their left endpoints are multiples ofδ/_(h). We now change their left endpoints as follows. For intervalswith lengths in [A, 2A), now make the left endpoints multiples ofδA≦δl_(h) (rather than δ/_(h)), shrinking the gap between consecutiveleft endpoints and enlarging the number of intervals by less than afactor of 2. However, note the following important fact: all theintervals with lengths in [A, 2A) start at 0, δA, 2δA, 3δA, . . . By theprefix property, it suffices to include in the running time only thelength of the longest interval with a given starting point. Hence we canprocess all the intervals with lengths in [A, 2A) in time G(N)/Nmultiplied by O (N/(δA) (2A), which is g(N)/N times O (N/δ). Since thereare only Ig N such groups (and not log1+δN, as before), the total timeto process all intervals will be g(N)/N times O((N Ig N)/δ). Hence, forLIS computation, for example, for which g(N)/N is O(log N), the overalltime will be O((N Ig² N)/δ). I

THEORY 13 Let I θ be the set of intervals in an optimal solution, eachhaving confidence at least {hacek over (c)}, and θ be the set ofintervals considered by our algorithm. For each I εI, there exists αJ εθwith confidence≧(1-δ/1+ε) {hacek over (c)} containing I.

PROOF. How small a 6 must be used such that for any interval I=[a, b] ⊂[0,N] of length at least 1, one of our intervals contains I and haslength at most 1+_(e) times as large? Choose h smallest such thatlh-δlh≧b-a, i.e., lh≧(b-a)/(1-δ). Then one of our intervals starts at aor no more than δlh to the left of a, and ends at or to the right of b.That interval contains I, clearly. By minimality of h, lh-1<(b-a)/(1-δ), and therefore the length (1+δ)^(h) of our interval is atmost (1+δ)/(1−δ) times the length of I, proving theory 13. Theory 13implies that it suffices to choose δ small enough that (1+δ)/(1-δ)≦1+e,i.e., δ≦e/(2+e). (For brevity, some implementation details on convertingthe real intervals into sets of contiguous integers have been omitted.)

Given a set of intervals in [0,N] satisfying the confidence threshold,each with integral endpoints and no two with the same left endpoint, wecan assemble a tableau T_(r) with support at least g by selecting enoughintervals to cover the desired number of points; in particular, we wishto choose the minimal number of intervals needed. Each selected(position) interval [i . . . j] then determines the tableau pattern[t_(π(i))[X], t_(π(j))[X]], i.e., the position interval mapped back to arange of X-values. We first show that, unlike the more general PARTIALSET COVER problem, our problem is in P, by exploiting the fact that wehave intervals rather than arbitrary sets. We give a O(N²)-time dynamicprogramming algorithm to find a minimal (partial) cover. The algorithmtakes as input a set θ of intervals of the form [i . . . j]={i, i+1, . .. j}, for some 1≦i, j≦N, and assumes they are sorted on their leftendpoints. Via dynamic programming, the algorithm computes, for each0≦k,l≦N, the value T(k, l) which equals the minimum number of the givenintervals necessary to cover at least k points among {1, 2, . . . , l}(or ∞ if it is not possible to do so); the final answer is T[ŝN],N).T(0, 0)=0 and T(k, 0)=∞ for all k>0. After T(k, l′) has been computedfor all l′<l and all k=0, 1, 2, . . . , N, the algorithm computes T(k,l) for all k=0, 1, 2, . . . , N, using Lemma 14.

LEMMA 14 If there is no input interval containing l, then T(k, l)=T(k,l−(k, l−1) Otherwise, among all intervals containing l, choose the onewhose left endpoint is smallest; denote its left endpoint by l−z+1. Then

T(k, l)=min{T(k, /-1), 1+T(k−z, l−z)}.

PROOF. As the first statement is obvious, we move on to the second. Theoptimal way to cover at least k of the points 1, 2, . . . , l eithercovers the point l or it does not. If it does not, its cost is T(k,l−1). If it does, its contains some interval which contains l. Withoutloss of generality it contains, among those intervals containing l, theone whose left endpoint is as small as possible. Suppose that thatinterval has left endpoint l−z+1 and therefore covers the z pointsl−z+1, l−z+2, . . . , l. Then T(k, l)=T(k−z, l−z)+1. Lemma 14 suggestsan easy O(N²)-time algorithm for computing all the T(k, l) values. Sincethe quadratic complexity of the dynamic programming algorithm makes itinfeasible for large data sets, we consider an approximation to find anearly minimal size using a greedy algorithm for PARTIAL SET COVER. Weshow that, for the special case in which the sets are intervals, thealgorithm can be implemented in linear time and provides a constantperformance ratio.

THEORY 15 The greedy partial set cover algorithm can be implemented torun in time O(N).

PROOF. A set of intervals is given sorted on left (and also right)endpoints by the candidate generation phase. We separately maintainthese intervals ordered by set cardinality in an array 1 . . . N oflinked lists, where the array index corresponds to cardinality. At eachstep, we iterate down (initially from N) to the largest index containinga non-empty linked list, to find an interval with the largest “marginal”cardinality (which only counts points that have not already been coveredby an interval that has already been added to the tableau), and adjustthe marginal cardinalities of any overlapping intervals. Consider theintervals shown in FIG. 3( a) and suppose that the longest one has justbeen added to the tableau. As seen in FIG. 3( b), six intervals need tohave their marginal cardinalities updated. Further, of these sixintervals, which are now shorter, four are now contained in other onesand may be deleted. In general, each iteration of the algorithm deletesall but one interval intersecting the left endpoint of the currentlylongest interval; likewise for the right endpoint. Since there are atmost N iterations and we adjust at most two intervals per iteration, thetime spent adjusting the nondeleted intervals is N*O(1)=O(N). The totaltime spent deleting intervals, over the entire execution of thealgorithm, is O(N), since there are at most N intervals.

THEORY 16 The greedy algorithm gives a constant performance ratio.

An important property of our framework is that the size of a generatedtableau can be no larger than the tableau generated when there is nocheating on confidence in the candidate interval phase, given the sameconfidence threshold. This is easy to see because cheating on confidencecan only yield intervals subsuming optimal intervals, and with betterchoices available an optimal (partial) set cover will be at most aslarge.

To give examples of confidence metrics, first, we show that our tableaugeneration framework is compatible with our definition of confidence(Definition 2). In the special case of “simple” CSDs, we need to computethe length of a LIS in a given interval in order to compute itsconfidence. Many implementations of longest increasing subsequenceincrementally maintain LIS on increasing prefixes in 0(N log/N) time;hence, LIS satisfies the prefix property. As for the containmentproperty, clearly if one interval is contained in another, then anysubsequence of the smaller interval must be contained in the larger.Therefore, for simple CSDs, our framework is able to find candidates inO((N log² N)/δ) time. While there is prior work on simultaneouslycomputing LIS's of multiple (overlapping) windows of a sequence, none ofthis work breaks the quadratic complexity barrier. Recent work oncomputing the approximate size of a longest increasing subsequence onstreams saves space but not time. Hence, we are not aware of a fasterway to compute LIS that can help in our context. The dynamic programgiven above provides values at every prefix en route to computing theconfidence of the entire interval, thus satisfying the prefix property.The containment property is also satisfied because the same valid gapsequence converted from an interval would also be available to anyinterval containing it; it would require no more deletions than thedifference in the lengths to transform the larger interval into the samevalid gap sequence. So for general CSDs, our framework is able to findcandidates in O(G₂/G₂−G₁) (N log² N)/δ) time. If one prefers to defineconfidence differently, such as based on the average number ofinversions for SDs of the form X→_([0,∞)) Y, or based on the fraction ofgaps within [G₁, G₂] for SDs of the form X→[G₁, G₂] Y with G₂<∞, thenour framework also applies.

An experimental evaluation follows of the proposed tableau discoveryframework for conditional sequential dependencies, which comprisescandidate interval generation (CANDGEN) and tableau assembly (TABASSMB).First, to justify the motivation and utility of CSDs, we present sampletableaux which unveil interesting data semantics and potential dataquality problems. Second, for both CANDGEN and TABASSMB, we investigatethe trade-off between tableau quality and performance of resorting toapproximation. Finally, we demonstrate the efficiency and scalability ofthe proposed tableau generation framework.

Experiments were performed on a 2.7 GHz dual-core Pentium PC with 4 GBof RAM. The performance numbers presented are based on real time asreported by the Unix time command. Experiments were run 5 times and theaverage time was reported. All algorithms were implemented in C++. Weused the following four data sources for our experiments. Table 1displays a summary of data characteristics.

DOWJONES consists of daily closing figures of the Dow Jones IndustrialAverage2, and has the schema (DATE, AVGCLOSING). The closing figureshave been smoothed using a 2-week moving window average.

-   -   WEATHERDATES consists of the days on which daily temperatures        were recorded at Gabreski Airport3 in Long Island, N.Y., from        1943.07.18-2008.10.01 by Global Summary of the Day.    -   NETWORKFEEDS consists of data feeds of probed measurements from        an ISP and the associated timestamps when they were received.    -   TRAFFICPOLLS contains the timestamps of traffic volume        measurements in an ISP that were configured to be taken every 5        minutes.

TABLE 1 Summary of Data Sources DATASET #TUPLES DEPENDENCY DOWJONES27399 DATE −> (0, 1) AVGCLOSING NETWORKFEEDS 916961 STARTTIME −> (0, 1)ENDTIME WEATHERDATES 15716 ARRIVALORDER −> [0, 1]DATE TRAFFICPOLLS 91522ARRIVALORDER −> [270, 330]TIME

In the experiments that follow, we use the confidence threshold {hacekover (c)}=0.995, support threshold ŝ=0.5 (note that the tableau assemblyalgorithm may terminate before reaching the required support if it runsout of candidate patterns), and approximation tolerance parameterδ=0.05, unless mentioned otherwise.

We first show that CSDs with different gap values can captureinteresting semantics. We also show that our approximate frameworkdiscovers tableaux that are close to optimal. Table 2 compares tableauxgenerated by exhaustive candidate generation (EXACTINTVL) and ourapproximate candidate generation (APPRXINTVL), for various gaps withgreedy TABASSMB on the WEATHERDATES dataset. The support of each patternis also shown, indicating the number of data values contained in thecorresponding interval. Gap ranges of [0, 1] (at least one temperaturereading per day) and [0, 2] (one reading every two days) result intableaux with two rows, indicating that there was at least one majorbreak in the data recording.

TABLE 2 Tableau sizes for various gap values on WEATHERDATES EXACTINTVLSup- APPRXINTVL Sup- Tableau port Tableau port Gap: [0, 1] Tableau size:2 Tableau size: 2 1945.11.06-1969.12.15 6819 1944.02.07-1981.01.23 75361980.12.09-1990.12.10 3636 1981.02.05-2006.02.04 6999 Gap: [0, 2]Tableau size: 2 Tableau size: 2 1945.11.01-1969.12.15 68241944.02.07-1981.01.29 7542 1980.10.22-1991.01.23 36811981.02.05-2006.07.31 7176 Gap: [0, 5] Tableau size: 2 Tableau size: 11945.10.29-1969.12.15 6827 1943.07.18-2008.05.25 155881980.11.23-1995.05.10 5115 Gap: [2, 10] Tableau size: 20 Tableau size:20 1995.06.23-1995.06.28 3 1995.06.23-1995.06.28 3 1983.02.21-1983.02.232 1983.02.21-1983.02.23 2 1985.09.27-1985.10.01 2 1985.09.27-1985.10.012 1988.04.05-1988.04.08 2 1988.04.05-1988.04.08 2 Gap: [6, 10] Tableausize: 1 Tableau size: 1 1991.01.01-1991.01.08 2 1991.01.01-1991.01.08 2Gap: [10, 20] Tableau size: 6 Tableau size: 6 1951.06.01-1951.06.12 21951.06.01-1951.06.12 2 1980.11.23-1980.12.09 2 1980.11.23-1980.12.09 21990.12.20-1991.01.01 2 1990.12.20-1991.01.01 2 1991.01.11-1991.01.23 21991.01.11-1991.01.23 2 1991.02.18-1991.03.01 2 1991.02.18-1991.03.01 21994.07.15-1994.07.27 2 1994.07.15-1994.07.27 2 Gap: [20, 1) Tableausize: 9 Tableau size: 9 1945.11.29-1951.04.30 2 1945.11.29-1951.04.30 21969.12.15-1980.10.22 2 1969.12.15-1980.10.22 2 1991.10.02-1991.10.30 21991.10.02-1991.10.30 2 1993.10.19-1993.11.16 2 1993.10.19-1993.11.16 21994.02.03-1994.03.01 2 1994.02.03-1994.03.01 2 1995.05.10-1995.06.22 21995.05.10-1995.06.22 2

Note that the exact and approximate tableaux “latch onto” differentendpoints.

This was due to δ being set to 0.05, which meant that a confidencethreshold of 0.995 was used for the exact tableau whereas effectively0.995(1-0.05)/(1+0.05)=0.9 was used for the approximate one. When weused δ=0.01 for the gap range [0, 2], the approximate tableau was thesame as the exact one. Next, we identify time ranges over differentscales over which no temperature data was recorded. A gap range [2, 10]was used to find periods when the recording was discontinued for aboutten days at a time, possibly due to malfunctioning equipment. Acomparison of the tableau row start and end dates, as well as theirassociated supports, reveal that the exact and approximate tableaux werequite similar, and both indicate periods when no data was recorded. Agap range of [6, 10] helps identify a time-frame from 1991.01.01 to1991.01.08 which has 6 days of missing data. (since the support is 2,only the beginning point and the endpoint are present in the data).Similarly, [10, 20] returned 6 periods of moderate data loss—ten to 20days at a time. In order to capture regions of long gaps, a gap range of[20,∞) was used. The first two patterns identify the two time periods ofmost significant loss: 1945 to 1951 and 1969 to 1980, when, according tothe Wikipedia page for this airport, it was closed to the public.

Table 3 presents the sample tableaux for TRAFFICPOLLS.

TABLE 3 Tableaux for TRAFFICPOLLS APPRXINTVL Tableau Support Gap: [270,330] Tableau size: 2 2008-10-09, 05:17:06-2009-03-06, 19:10:17 399252008-04-22, 23:15:38-2008-05-25, 21:33:20 8683 Gap: [0, 150] Tableausize: 751 2008-03-17, 21:07:02-2008-03-17, 21:08:56 2 2008-04-14,16:00:23-2008-04-14, 16:00:23 2 2008-05-26, 05:05:31-2008-05-26,05:05:31 2 2008-05-26, 06:17:43-2008-05-26, 06:17:47 2 Gap: [350, 1)Tableau size: 4001 2008-09-14, 13:18:38-2008-09-14, 14:59:24 142008-09-17, 07:33:51-2008-09-17, 09:06:06 11 2008-09-17,01:22:46-2008-09-17, 02:11:02 7 2008-04-13, 03:48:21-2008-04-13,05:32:38 6

The expected time gap between two successive polls is 5 minutes, or 300seconds. Due to several factors from traffic delays to clocksynchronization, this exact periodicity will hardly ever be met.Therefore, we allow for ±30 seconds and use a gap of 270 seconds to 330seconds. The gap range [270, 330] is satisfied by much of the data andgives a tableau size of two. Next, a gap range of [0, 150] was used toidentify regions of extraneous polls. There are several instances ofvery short time differences between polls, but these tend to occur onlybriefly (one poll). A gap range of [350,∞) was then used to identifyregions with heavily delayed or missing data, which, when correlatedwith other information collected by the ISP, helped solve problems withthe data collection mechanism. Table 4 presents sample tableaux fordifferent gap ranges on the DOWJONES data set.

TABLE 4 Tableaux for DOWJONES APPRXINTVL Sup- APPRXINTVL Sup- Tableauport Tableau port Gap: [0, ∞) Gap: [0, 5] Tableau size: 246 Tableausize: 286 1949.06.07-1950.06.22 261 1949.06.07-1950.06.22 2611904.05.17-1905.04.26 237 1904.05.17-1905.04.26 2371921.08.15-1922.06.13 206 1921.08.15-1922.06.13 2061953.09.18-1954.06.08 179 1953.09.18-1954.06.08 1791942.07.28-1943.04.12 176 1942.07.28-1943.04.12 1761925.03.24-1925.11.16 166 1925.03.24-1925.11.16 1661915.05.19-1916.01.07 162 1915.05.19-1916.01.07 1621898.09.30-1899.05.05 149 1898.09.30-1899.05.05 1491958.04.11-1958.10.24 138 1935.03.14-1935.09.24 1351935.03.14-1935.09.24 135 1945.07.26-1946.02.13 134 Gap: [50, 100] Gap:[100, ∞) Tableau size: 45 Tableau size: 4 2000.10.27-2000.11.08 92000.03.20-2000.03.28 7 1998.10.14-1998.10.23 8 2001.04.17-2001.04.18 21999.03.10-1999.03.18 7 2002.10.18-2002.10.21 2 2001.04.24-2001.05.01 62002.10.22-2002.10.23 2 2001.11.08-2001.11.19 6 2002.03.01-2002.03.08 62003.03.19-2003.03.26 6 1999.04.13-1999.04.16 4 1999.04.20-1999.04.23 41999.07.08-1999.07.13 4

Patterns for (0,∞) show time ranges over which Dow Jones stock marketexhibited an increasing trend with very high confidence of 0.995. Thefirst few patterns for gap [0, 5] are similar to those of (0,∞). Thisimplies that successive increases in stock market prices, particularlyover long periods of time, are usually by small amounts which mostly liewithin the small range of [0, 5]. Gaps [50, 100] and [100,∞) captureregions where the stock market average closing price increased rapidly.The resulting tableau suggests that sharp increases in stock prices weremostly observed during the late nineties and early years of the 21stcentury, probably due to the “dotcom boom” and “housing bubble”. DowJones data was chosen arbitrarily to demonstrate the disclosure. Anyother investment data could be chosen.

Good tableau quality can be demonstrated by comparing EXACTINTVL and

APPRXINTVL tableaux over a wide variety of {hacek over (c)}, ŝ and δvalues. Since it is impractical to present actual tableaux for all theaforementioned cases, we use tableau size as a substitute for tableauquality and compare tableau sizes instead. FIG. 4 demonstrates thequality results of our approximate algorithms for DOWJONES data. For thegap range [0,∞), FIG. 4 compares the tableau sizes obtained fromcandidate intervals generated by EXACTINTVL and APPRXINTVL (usingdifferent δ's), as a function of confidence threshold {hacek over (c)},using support ŝ=0.5. (Exact tableau assembly was used on the candidatesfrom both methods.) The tableau sizes are quite similar for low valuesof {hacek over (c)}, and for high values of {hacek over (c)} with lowvalues of δ. For example, at {hacek over (c)}=0.8 with δ=0.01, there wasonly a difference in size of 12. For the gap range [0, 5], FIG. 7 showsa greater sensitivity to δ, as there was a much more pronounceddifference in tableau sizes at δ=0.05, but again for δ=0.01 thedifference was relatively small.

In the previous experiments, although a desired confidence threshold of{hacek over (c)} was supplied, the algorithm relaxes this value to aslow as (1-δ/1+δ) {hacek over (c)} to guarantee that all optimalcandidate intervals are covered by some interval reported by theapproximation algorithm. Hence, the tableau size is never larger, andmay be smaller, than that of an optimal solution. However, if one doesnot wish to allow such “cheating on confidence”, then an alternative isas follows. Instead, we can “inflate” the desired confidence from {hacekover (c)} to min (1, (1-δ/1+δ)) {hacek over (c)}) so that the relaxedconfidence is now no less than {hacek over (c)} (but no greater thanone), and thus not “cheat”. Of course, this may now result in largertableaux than the optimal. As usual, the effect of this will depend onδ: when δ is small, {hacek over (c)} will only be inflated by a smallamount and thus the tableau sizes will be closer to optimal. Thetrade-off is that the algorithm takes longer with smaller values of δbut, as we shall show in the next subsection when we investigateperformance, even with relatively small values of δ there is asignificant improvement over running the exact algorithm. FIGS. 5 and 8compare the results of

OPTIMAL (EXACTINTVL with unmodified {hacek over (c)}),

apprx (APPRXINTVL with inflated {hacek over (c)})

optimal (EXACTINTVL with inflated {hacek over (c)}) with variousδ-values. Note that the tableau size of OPTIMAL lower-bounds that ofapprx, which lower-bounds that of optimal.

FIG. 5 assumes gaps of [0,∞), whereas FIG. 8 uses gaps of [0, 5]. In allcases, the tableau sizes for APPRXINTVL (with inflated {hacek over (c)})are lower-bounded by OPTIMAL (with unmodified {hacek over (c)}). Thisimplies that, in order to obtain an “exact” tableau (using EXACTINTVL)for a given {hacek over (c)}, one might as well assume an inflatedconfidence of (1-δ/1+δ) {hacek over (c)} (not exceeding 1) and obtainthe same results using the much faster APPRXINTVL CANDGEN. Similarbehavior was observed for other ŝ values. For higher ŝ, the absolutevalue of the tableau sizes increase, as expected.

FIGS. 6 and 9 show that the greedy set cover algorithm (GREEDYASMBL)gives similar sized tableaux compared to the optimum that dynamicprogramming (EXACTASMBL) obtains—the curves are almostindistinguishable—at a variety of support thresholds with N=10000. FIG.6 assumes gaps of [0,∞) and {hacek over (c)}=0.63, whereas FIG. 9 usesgaps of [0, 5] and {hacek over (c)}=0.7. Both the exact and approximateTABASSMB algorithms ran on the same set of input candidate intervals,which were generated by APPRXINTVL. Similar figures were obtained forother values of {hacek over (c)} and other data sets.

The foregoing highlights the fact that tableaux generated by APPRXINTVLare close to that of EXACTINTVL. We now show that APPRXINTVL cangenerate accurate tableaux while still being faster than EXACTINTVL byorders of magnitude. For the sake of efficiency on large inputs, alltableau generation methods here use GREEDYASMBL for assembly; resultsare reported as combined running time of CANDGEN and TABASSMB phases.

FIGS. 10-15 compare the performance of APPRXINTVL (using variousδ-values) with EXACTINTVL for different data set sizes. For the gaprange [0,∞), FIG. 10 and FIG. 11 present results using DOWJONES andNETWORKFEEDS data, respectively. APPRXINTVL scales more gracefully,especially in FIG. 11 where the exhaustive algorithm was halted becauseit ran too long. FIG. 14 and FIG. 15 show results using WEATHERDATESwith gaps in [4, 6] and TRAFFICPOLLS with gaps in [270, 330],respectively. As before, the APPRXINTVL algorithm results in substantialimprovement over EXACTINTVL, particularly for large number of inputs ascan be seen in FIG. 15. While the performance is noticeably better withlarger δ-values, in all cases the performance of APPRXINTVL is muchfaster (orders of magnitude) than that of EXACTINTVL, even with very lowvalues of δ (e.g., 0.01). FIGS. 12 and 13 separate out the performancecomparison of CANDGEN and TABASSMB phases, using DOWJONES data. FIG. 12compares the running times of APPRXINTVL with EXACTINTVL and FIG. 13compares GREEDYASMBL with EXACTASMBL. In FIG. 13, the curve forGREEDYASMBL is indistinguishable from the x-axis.

The examples given above involve numeric sequences. The underlyingtechniques may be applied to other forms of sequences, for example,alphabetic sequences, or alphanumeric sequences. Accordingly theexpression alphabetic/numeric is used to include these alternativesequences.

In summary one example of the technological data process describedmeasures the fractional satisfiability of a given numeric sequence basedon conformity with a predetermined constraint on the minimum and maximumdifference between consecutive values in the sequence. Fundamentally, itinvolves inserting and/or deleting values into the sequence such thatthe new sequence will satisfy the predetermined constraints. Thesatisfiability factor of the data sequence is represented by the totalnumber of such insertions and deletions in proportion to the sequencesize. The resulting satisfiability factor may be used to evaluate thedata stream in relation to other data sequences, or preset targets. Ifthe satisfiability factor is below a given threshold remedial measuresmay be taken to improve the quality of the data sequence.

Another example of the disclosure involves locating intervals of asequence of numerical values with errors by testing different lengths atdifferent starting positions to determine a satisfiability factor foreach length, selectively choosing final intervals with a desired maximumsatisfiability factor and summarizing these final intervals in atableau. The intervals may be overlapping or non-overlapping.

As demonstrated by the examples given above, the data processes of thedisclosure are implemented in electronic data processing devices andsystems.

Various additional modifications of this disclosure will occur to thoseskilled in the art. All deviations from the specific teachings of thisspecification that basically rely on the principles and theirequivalents through which the art has been advanced are properlyconsidered within the scope of the disclosure as described and claimed.

To analyze large data steams for target anomalies, “sequentialdependencies” (SDs) are chosen for ordered data and present a frameworkfor discovering which subsets of the data obey a given sequentialdependency. Given an interval G, a sequential dependency (SD) onattributes X and Y, written as X→G Y, denotes that the distance betweenthe Y-values of any two consecutive records, when sorted on X, arewithin G. SDs of the form X→(0,∞) Y and X→(−∞,0] Y specify that Y isstrictly increasing and non-increasing, respectively, with X, andcorrespond to classical Order Dependencies (ODs). They are useful indata quality analysis (e.g., sequence numbers must be increasing overtime) and data mining. SDs generalize ODs and can express other usefulrelationships between ordered attributes. An SD of the form sequencenumber→_([4,5]) time specifies that the time “gaps” between consecutivesequence numbers are between 4 and 5. In the context of data quality,SDs can measure the quality of service of a data feed that is expectedto arrive with some frequency. In terms of data mining, the SDdate→_([20,∞)) price identifies data streams wherein the data pointsrapidly increase from day to day by at least 20.

In practice, even “clean” data may contain outliers. The degree ofsatisfaction of an SD by a given data set is evaluated via a confidencemeasure. Furthermore, real data sets, especially those with orderedattributes, are inherently heterogeneous, e.g., the frequency of a datafeed varies with time of day, measure attributes fluctuate over time,etc. Thus, the SDs may be extended to Conditional SequentialDependencies (CSDs), analogously to how Conditional FunctionalDependencies extend traditional Functional Dependencies (FDs). A CSDconsists of an underlying SD plus a representation of the subsets of thedata that satisfy this SD. Similar to CFDs, the representation used hereis a “tableau”, where the tableau rows are intervals on the orderedattributes.

To make sequential dependencies applicable to real-world data, the SDrequirements may be relaxed and allowed to hold approximately (with someexceptions) and conditionally (on various subsets of the data). Thus,examples disclosed herein contemplate the use of conditional approximatesequential dependencies for discovering pattern tableaux, i.e., compactrepresentations of the subsets of the data that satisfy the underlyingdependency.

We claim:
 1. A method, comprising: modifying, via a processor, a firstnumber of values in a sequence of a data set to generate a modifiedsequence such that each difference between each pair of successivevalues is within a threshold; and determining, via the processor, asatisfiability metric for the modified sequence based on a relationshipbetween a number of modifications to the values in the sequence and asize of the sequence.
 2. The method of claim 1, wherein the sequencerepresents investment data.
 3. The method of claim 1, wherein thesequence represents traffic data.
 4. The method of claim 1, wherein thesequence represents weather data.
 5. The method of claim 1, wherein thesatisfiability metric represents a ratio between the number ofmodifications and the size of the sequence.
 6. The method of claim 1,wherein the sequence is a first sequence, the modified sequence is afirst modified sequence, the satisfiability metric is a firstsatisfiability metric, and further comprising: modifying a second numberof values in a second sequence of the data set to generate a secondmodified sequence; determining a second satisfiability metric for thesecond modified sequence based on a relationship between a number ofmodifications to values in the second sequence and a size of the secondsequence; and selecting one of the first or second sequences based on acomparison of the first satisfiability metric and the secondsatisfiability metric.
 7. The method of claim 6, wherein the firstsequence and the second sequence are subsets of the data set.
 8. Themethod of claim 6, further comprising summarizing the selection of thefirst sequence or second sequence in a table.
 9. The method of claim 6,wherein selecting one of the first sequence or the second sequencecomprises determining which of the first satisfiability metric or thesecond satisfiability metric corresponds to a lesser number ofmodifications in proportion to the respective size of the first sequenceand the second sequence.
 10. A machine readable memory comprisinginstructions which, when executed, cause a machine to perform operationscomprising: modifying a first number of values in a sequence of a dataset to generate a modified sequence such that each difference betweeneach successive pair of values satisfies a threshold; and determine asatisfiability metric for the modified sequence based on a relationshipbetween a number of modifications to the values in the sequence and asize of the sequence.
 11. The memory of claim 10, wherein determiningthe satisfiability metric comprises determining a ratio between thenumber of modifications and the size of the sequence.
 12. The memory ofclaim 10, wherein the sequence is a first sequence, the modifiedsequence is a first modified sequence, the satisfiability metric is afirst satisfiability metric, and further comprising instructions which,when executed, cause the machine to perform operations comprising:modifying a second number of values in a second sequence of the data setto generate a second modified sequence; determining a secondsatisfiability metric for the second modified sequence based on arelationship between a number of modifications to values in the secondsequence and a size of the second sequence; and selecting one of thefirst or second sequences based on a comparison of the firstsatisfiability metric and the second satisfiability metric.
 13. Thememory of claim 12, wherein the first sequence and the second sequenceare subsets of the data set.
 14. The memory of claim 12, furthercomprising summarizing the selection of the first sequence or secondsequence in a table.
 15. The memory of claim 12, wherein selecting oneof the first sequence or the second sequence comprises determining whichof the first satisfiability metric or the second satisfiability metriccorresponds to a lesser number of modifications in proportion to therespective size of the first sequence and the second sequence.
 16. Anapparatus comprising: a memory comprising machine readable instructions;and a processor which, when executing the instructions, performsoperations comprising: modifying a first number of values in a sequenceof a data set to generate a modified sequence such that each differencebetween each successive pair of values meets a threshold; anddetermining a satisfiability metric for the modified sequence based on arelationship between a number of modifications to the values in thesequence and a size of the sequence.
 17. The apparatus of claim 16,wherein determining the satisfiability metric comprises determining aratio between the number of modifications and the size of the sequence.18. The apparatus of claim 16, wherein the sequence is a first sequence,the modified sequence is a first modified sequence, the satisfiabilitymetric is a first satisfiability metric, and the operations furthercomprise: modifying a second number of values in a second sequence ofthe data set to generate a second modified sequence; determining asecond satisfiability metric for the second modified sequence based on arelationship between a number of modifications to values in the secondsequence and a size of the second sequence; and selecting one of thefirst or second sequences based on a comparison of the firstsatisfiability metric and the second satisfiability metric.
 19. Theapparatus of claim 18, wherein the first sequence and the secondsequence are subsets of the data set.
 20. The apparatus of claim 18,wherein the operations further comprise determining which of the firstsatisfiability metric or the second satisfiability metric corresponds toa lesser number of modifications in proportion to the respective size ofthe first sequence and the second sequence.